Title
> "데이터 선정 및 EDA"
- toc:true- branch: master
- badges: true
- comments: true
- author: Hamel Husain & Jeremy Howard
- categories: [fastpages, jupyter]
import numpy as np
import pandas as pd
import gc
import time
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 200)
app_train = pd.read_csv('application_train.csv')
app_test = pd.read_csv('application_test.csv')
app_train.head()
Feature 설명(너무 많은 Feature로 일부만 소개하도록 하겠습니다.)
- SK_ID_CURR : 현재 대출 고유 ID
- TARGET : 타겟값(채무 이행/불이행)
- AMT_INCOME_TOTAL : 소득
- AMT_CREDIT : 대출금액
- AMT_ANNUITY : 월 대출 지급액
- AMT_GOODS_PRICE : 소비자 대출 상품액
- EXT_SOURCE_1 ~ EXT_SOURCE_3 : 정규화된 스코어
- APARTMENTS_AVG ~ EMERGENCYSTATE_MODE : 고객 거주지역의 특정값(정규화된 값)
- OBS_30_CNT_SOCIAL_CIRCLE : 30일 연체된 횟수
- DEF_30_CNT_SOCIAL_CIRCLE : 30일 연체로 Default 된 횟수
- OBS_60_CNT_SOCIAL_CIRCLE : 60일 연체된 횟수
- DEF_60_CNT_SOCIAL_CIRCLE : 60일 연체로 Default 된 횟수
- DAYS_LAST_PHONE_CHANGE : 대출 신청전 핸드폰 변경 횟수
- FLAG_DOCUMENT_2 ~ FLAG_DOCUMENT_21 : 문서 제출 여부
- AMT_INSTALMENT : 대출 월 납입예정액
- AMT_PAYMENT : 실제 월 납입액
app_train.shape, app_test.shape
# 매우 많은 feature과 Data..
app_train['TARGET'].value_counts()
# 치우쳐진(Skew) Data이다.
app_train['AMT_CREDIT'].hist()
sns.distplot(app_train['AMT_INCOME_TOTAL'])
sns.boxplot(app_train['AMT_INCOME_TOTAL'])
app_train[app_train['AMT_INCOME_TOTAL'] < 1000000]
app_train[app_train['AMT_INCOME_TOTAL'] < 1000000]['AMT_INCOME_TOTAL'].hist()
sns.displot(app_train[app_train['AMT_INCOME_TOTAL'] < 1000000]['AMT_INCOME_TOTAL'])
sns.distplot(app_train[app_train['AMT_CREDIT'] < 1000000]['AMT_CREDIT']))
cond1 = (app_train['TARGET'] == 1)
cond0 = (app_train['TARGET'] == 0)
cond_amt = (app_train['AMT_INCOME_TOTAL'] < 500000)
# 2개의 subplot을 생성하고 왼쪽에는 violinplot을 오른쪽에는 distplot을 표현
fig, axs = plt.subplots(figsize=(12, 4), nrows=1, ncols=2, squeeze=False)
# violin plot을 왼쪽 subplot에 그림.
sns.violinplot(x='TARGET', y='AMT_INCOME_TOTAL', data=app_train[cond_amt], ax=axs[0][0] )
# Histogram을 오른쪽 subplot에 그림.
sns.distplot(app_train[cond0 & cond_amt]['AMT_INCOME_TOTAL'], ax=axs[0][1], label='0', color='blue')
sns.distplot(app_train[cond1 & cond_amt]['AMT_INCOME_TOTAL'], ax=axs[0][1], label='1', color='red')
- AMT_INCOME_TOTAL이 100000~150000사이의 값에서 대부분 data가 모여있는것을 알 수 있다.
- 시각화를 하는 코드를 함수로 구현해보았다
def show_column_hist_by_target(df, column, is_amt=False):
cond1 = (df['TARGET'] == 1)
cond0 = (df['TARGET'] == 0)
fig, axs = plt.subplots(figsize=(12, 4), nrows=1, ncols=2, squeeze=False)
# is_amt가 True이면 < 500000 조건으로 filtering
cond_amt = True
if is_amt:
cond_amt = df[column] < 500000
sns.violinplot(x='TARGET', y=column, data=df[cond_amt], ax=axs[0][0] )
sns.distplot(df[cond0 & cond_amt][column], ax=axs[0][1], label='0', color='blue')
sns.distplot(df[cond1 & cond_amt][column], ax=axs[0][1], label='1', color='red')
show_column_hist_by_target(app_train, 'AMT_INCOME_TOTAL', is_amt=True)
app_train.shape, app_test.shape
apps = pd.concat([app_train, app_test])
apps.shape
apps['TARGET'].value_counts(dropna=False)
apps.info()
object_columns = apps.dtypes[apps.dtypes == 'object'].index.tolist()
object_columns
apps['CODE_GENDER']
# pd.factorize(Category컬럼 Series)는 Label인코딩된 Series와 uniq한 Category값을 반환함.
# [0]을 이용하여 Label인코딩 Series만 취함.
apps['CODE_GENDER'] = pd.factorize(apps['CODE_GENDER'])[0]
apps['CODE_GENDER']
# pd.factorize()는 한개의 컬럼만 Label 인코딩이 가능하므로 object형 컬럼들을 iteration하면서 변환 수행.
for column in object_columns:
apps[column] = pd.factorize(apps[column])[0]
apps.info()
apps.isnull().sum().head(100)
#
apps = apps.fillna(-999)
app_train = apps[apps['TARGET'] != -999]
app_test = apps[apps['TARGET']== -999]
app_train.shape, app_test.shape
app_test = app_test.drop('TARGET', axis=1)
app_test.shape
app_test.dtypes
ftr_app = app_train.drop(['SK_ID_CURR', 'TARGET'], axis=1)
target_app = app_train['TARGET']
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(ftr_app, target_app, test_size=0.3, random_state=2021)
train_x.shape, valid_x.shape
from lightgbm import LGBMClassifier
clf = LGBMClassifier(
n_jobs=-1,
n_estimators=1000,
learning_rate=0.02,
num_leaves=32,
subsample=0.8,
max_depth=12,
silent=-1,
verbose=-1
)
clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)],
eval_metric= 'auc', verbose= 100, early_stopping_rounds= 50)
from lightgbm import plot_importance
plot_importance(clf, figsize=(16, 32))
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc,os,sys
import random
from sklearn.model_selection import KFold, StratifiedKFold
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_colwidth', 30)
app_train = pd.read_csv('application_train.csv')
app_test = pd.read_csv('application_test.csv')
def show_hist_by_target(df, column):
cond_1 = (df['TARGET'] ==1)
cond_0 = (df['TARGET'] ==0)
for column in columns:
print('column names:', column)
fig,axs = plt.subplots(figsize=(12,4), nrows=1, ncols=2, squeeze=False)
sns.violinplot(x='TARGET', y=column, data=df, ax=axs[0][0])
sns.distplot(df[cond_1][column], label='1', color='red', ax=axs[0][1])
sns.distplot(df[cond_0][column], label='1', color='blue', ax=axs[0][1])
columns = ['AMT_INCOME_TOTAL','AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH',
'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'EXT_SOURCE_1',
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
show_hist_by_target(app_train, columns)
- AMT_INCOME_TOTAL, AMT_CREDIT 은 크게 차이가 없음.
- AMT_ANNUITY는 TARGET=1시 상대적으로 작은 숫자대의 값이 조금 많음.
- AMT_GOOD_PRICE는 크게 차이가 없음.
- DAYS_BIRTH는 TARGET=1시 적은 연령대의 숫자값이 상대적으로 많음.
- DAYS_EMPLOYED는 TARGET=1시 작은 값들이 조금 더 많음.
- DAYS_ID_PUBLISH, DAYS_REGISTRATION 는 TARGET=1시 최근 값들이 조금 더 많음.
- DAYS_LAST_PHONE_CHANGE는 큰 차이 없음.
- CNT_FAM_MEMBERS는 차이가 없음. outlier 때문에 histogram에 차이가 있어 보임.
- REGION_RATING_CLIENT는 큰 차이 없음
- EXT_SOURCE_1,EXT_SOURCE_2, EXT_SOURCE_3 모두 조금씩 차이가 있음.
- 나머지 컬럼모두 큰 차이가 없음.
- 전반적으로 연령대가 낮은(직장 경력이 적은), 소액 대출에서 상대적으로 연체 비중이 높음.
object_columns = app_train.dtypes[app_train.dtypes=='object'].index.tolist()
object_columns
def show_count_by_target(df, columns):
cond_1 = (df['TARGET'] == 1)
cond_0 = (df['TARGET'] == 0)
for column in columns:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(18, 4), squeeze=False)
# countplot을 이용하여 category값의 histogram 표현
chart0 = sns.countplot(df[cond_0][column], ax=axs[0][0])
# x축의 tick label들이 값 유형이 많으므로 45도로 회전하여 표현
chart0.set_xticklabels(chart0.get_xticklabels(), rotation=45)
chart1 = sns.countplot(df[cond_1][column], ax=axs[0][1])
chart1.set_xticklabels(chart1.get_xticklabels(), rotation=45)
show_count_by_target(app_train, object_columns)
# catplot을 이용하여 문제를 해결하자
# catplot을 이용하여 countplot을 특정 컬럼 값 조건에 따라 여러개의 subplot을 분리하여 보여줌.
sns.catplot(x="CODE_GENDER",col="TARGET", data=app_train, kind="count")
def show_category_by_target(df, columns):
for column in columns:
print('column names :', column)
chart = sns.catplot(x=column, col="TARGET", data=df, kind="count")
# X축의 feature name들이 겹쳐지지 않게 65도 정도 회전
chart.set_xticklabels(rotation=65)
show_category_by_target(app_train, object_columns)
cond_1 = (app_train['TARGET'] == 1)
cond_0 = (app_train['TARGET'] == 0)
cond_f = (app_train['CODE_GENDER'] == 'F')
cond_m = (app_train['CODE_GENDER'] == 'M')
# 전체 건수 대비 남성과 여성의 비율 확인
print(app_train['CODE_GENDER'].value_counts()/app_train.shape[0])
# TARGET=1 일 경우 남성과 여성의 비율 확인
print(app_train[cond_1]['CODE_GENDER'].value_counts()/app_train[cond_1].shape[0])
# TARGET=0 일 경우 남성과 여성의 비율 확인
print(app_train[cond_0]['CODE_GENDER'].value_counts()/app_train[cond_0].shape[0])
corr_columns = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'DAYS_EMPLOYED','DAYS_ID_PUBLISH', 'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE', 'AMT_INCOME_TOTAL', 'TARGET']
corr = app_train[corr_columns].corr()
plt.figure(figsize=(9, 9))
sns.heatmap(corr, annot=True)
app_train['DAYS_EMPLOYED'].value_counts()
app_train['CODE_GENDER'].value_counts()
app_train['DAYS_EMPLOYED'] = app_train['DAYS_EMPLOYED'].replace(365243, np.nan)
app_train['DAYS_EMPLOYED'].value_counts(dropna=False)
app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].isnull().sum()
app_train['EXT_SOURCE_1'].value_counts(dropna=False)
app_train['EXT_SOURCE_2'].value_counts(dropna=False)
app_train['EXT_SOURCE_3'].value_counts(dropna=False)
print('### mean ###\n', app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean())
print('### max ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].max())
print('### min ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].min())
print('### std ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std())
apps = pd.concat([app_train, app_test])
print(apps.shape)
apps['APPS_EXT_SOURCE_MEAN'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
apps['APPS_EXT_SOURCE_STD'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1)
#apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APPS_EXT_SOURCE_MEAN', 'APPS_EXT_SOURCE_STD']].head(10)
apps['APPS_EXT_SOURCE_STD'].isnull().sum()
apps['APPS_EXT_SOURCE_STD'] = apps['APPS_EXT_SOURCE_STD'].fillna(apps['APPS_EXT_SOURCE_STD'].mean())
apps['APPS_EXT_SOURCE_STD'].isnull().sum()
apps['APPS_ANNUITY_CREDIT_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_CREDIT']
apps['APPS_GOODS_CREDIT_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_CREDIT']
apps['APPS_CREDIT_GOODS_DIFF'] = apps['AMT_CREDIT'] - apps['AMT_GOODS_PRICE']
apps['APPS_ANNUITY_INCOME_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_INCOME_TOTAL']
apps['APPS_CREDIT_INCOME_RATIO'] = apps['AMT_CREDIT']/apps['AMT_INCOME_TOTAL']
apps['APPS_GOODS_INCOME_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_INCOME_TOTAL']
# 가족수를 고려한 가처분 소득 피처 가공.
apps['APPS_CNT_FAM_INCOME_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['CNT_FAM_MEMBERS']
apps['APPS_EMPLOYED_BIRTH_RATIO'] = apps['DAYS_EMPLOYED']/apps['DAYS_BIRTH']
apps['APPS_INCOME_EMPLOYED_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_EMPLOYED']
apps['APPS_INCOME_BIRTH_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_BIRTH']
apps['APPS_CAR_BIRTH_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_BIRTH']
apps['APPS_CAR_EMPLOYED_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_EMPLOYED']
object_columns = apps.dtypes[apps.dtypes == 'object'].index.tolist()
for column in object_columns:
apps[column] = pd.factorize(apps[column])[0]
apps.info()
apps_train = apps[~apps['TARGET'].isnull()]
apps_test = apps[apps['TARGET'].isnull()]
apps_test = apps_test.drop('TARGET', axis=1)
from sklearn.model_selection import train_test_split
ftr_app = apps_train.drop(['SK_ID_CURR', 'TARGET'], axis=1)
target_app = app_train['TARGET']
train_x, valid_x, train_y, valid_y = train_test_split(ftr_app, target_app, test_size=0.3, random_state=2021)
train_x.shape, valid_x.shape
from lightgbm import LGBMClassifier
clf = LGBMClassifier(
n_jobs=-1,
n_estimators=1000,
learning_rate=0.02,
num_leaves=32,
subsample=0.8,
max_depth=12,
silent=-1,
verbose=-1
)
clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], eval_metric= 'auc', verbose= 100,
early_stopping_rounds= 100)
from lightgbm import plot_importance
plot_importance(clf, figsize=(16, 32))
def get_apps_processed(apps):
# EXT_SOURCE_X FEATURE 가공
apps['APPS_EXT_SOURCE_MEAN'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
apps['APPS_EXT_SOURCE_STD'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1)
apps['APPS_EXT_SOURCE_STD'] = apps['APPS_EXT_SOURCE_STD'].fillna(apps['APPS_EXT_SOURCE_STD'].mean())
# AMT_CREDIT 비율로 Feature 가공
apps['APPS_ANNUITY_CREDIT_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_CREDIT']
apps['APPS_GOODS_CREDIT_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_CREDIT']
# AMT_INCOME_TOTAL 비율로 Feature 가공
apps['APPS_ANNUITY_INCOME_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_INCOME_TOTAL']
apps['APPS_CREDIT_INCOME_RATIO'] = apps['AMT_CREDIT']/apps['AMT_INCOME_TOTAL']
apps['APPS_GOODS_INCOME_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_INCOME_TOTAL']
apps['APPS_CNT_FAM_INCOME_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['CNT_FAM_MEMBERS']
# DAYS_BIRTH, DAYS_EMPLOYED 비율로 Feature 가공
apps['APPS_EMPLOYED_BIRTH_RATIO'] = apps['DAYS_EMPLOYED']/apps['DAYS_BIRTH']
apps['APPS_INCOME_EMPLOYED_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_EMPLOYED']
apps['APPS_INCOME_BIRTH_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_BIRTH']
apps['APPS_CAR_BIRTH_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_BIRTH']
apps['APPS_CAR_EMPLOYED_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_EMPLOYED']
return apps